Currently, our models for low prevalence species are unlikely to be very good because the mdoels are very overfitted. One way of dealing with the low abundance of certain species is by taking prevalence of a species across the entire dataset into account when deciding on the number of pseudoabsences to generate. To do this, while still accounting for sampling biases, we take the number of unique grid references that a species has been seen in and the number of unique grid references across the whole dataset. We then use the proportion of cells that a species of interest appears in to determine how many more pseudoabsences are generated. I.e. if a species is seen in 20% of cells, then we need to have 5x more pseudoabsences than presences. This should mean that the probablility of a species being present in any given location decreases, because they appear <50% of the time in the original dataset, and hopefully lead to more realistic models of presence. I am not sure how this works with random forest models because it is likely that they will then be trained to recognise more ‘absences’ than ‘presences’ which could lead to them only predicting absences.

In order to account for prevalence we’re going to need to reduce the data to only presence and absence points in each grid cell. At the moment, the data we are using has multiple sightings of species in the same grid cell if they are seen on different days of the year (we have only removed repeat sightings from the same day in the same grid cell). If we increase the number of pseudoabsences relative to the number of repeat sightings, we’re going to end up with a hyper-inflated number of absence points for certain species, e.g. species that are common in a very restricted area. If we do want to use prevalence and the repeat sightings data then there might be a way to scale numbers according to how many sightings have appeared in the dataset as well as by their prevalence.

So, I am going to run four logistic regression models for each species:

  1. duplicated data with equal numbers of presence and absence points (‘original way’)

  2. duplicated data with pseudoabsences scaled according to prevalence (‘original + prevalence’)

    • I don’t actually run this model because it takes too long and crashes my computer. I explain below why this isn’t the best method anyway.
  3. thinned data matched presence and absence (‘thinned + matched’)

  4. thinned data with pseudoabsences scaled according to prevalence (‘thinned + prevalence’)

environmental data

##  [1] "sea"                   "broad_wood"            "conif_wood"           
##  [4] "arable"                "impr_grass"            "neutr_grass"          
##  [7] "calc_grass"            "acid_grass"            "fen_marsh_swamp"      
## [10] "heather"               "heath_grass"           "bog"                  
## [13] "inland_rock"           "saltwater"             "freshwater"           
## [16] "sup_lit_rock"          "sup_lit_sed"           "lit_rock"             
## [19] "lit_sed"               "saltmarsh"             "urban"                
## [22] "suburban"              "AnnualTemp"            "MeanDiRange"          
## [25] "Isotherm"              "TempSeasonality"       "MaxTempWarmestMonth"  
## [28] "MinTempColdestMonth"   "TempAnnualRange"       "MeanTempWetQuarter"   
## [31] "MeanTempDriestQuarter" "MeanTempWarmQuarter"   "MeanTempColdQuarter"  
## [34] "AnnualPrecip"          "PrecipWetMonth"        "PrecipDriestMonth"    
## [37] "PrecipSeasonality"     "PrecipWettestQuarter"  "PrecipDriestQuarter"  
## [40] "PrecipWarmQuarter"     "PrecipColdQuarter"     "elevation_UK"         
## [43] "slope"                 "aspect"

Crop and drop correlated variables

Get an area of interest and drop correlated weather variables (display the ones that are kept). Limiting it to a northern-ish part of England.

## 11 variables from the 16 input variables have collinearity problem: 
##  
## AnnualPrecip PrecipWettestQuarter PrecipDriestQuarter PrecipWetMonth PrecipColdQuarter MinTempColdestMonth PrecipDriestMonth MeanTempWarmQuarter MaxTempWarmestMonth TempAnnualRange PrecipSeasonality 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( MeanTempColdQuarter ~ MeanTempDriestQuarter ):  0.1187639 
## max correlation ( PrecipWarmQuarter ~ TempSeasonality ):  -0.6616779 
## 
## ---------- VIFs of the remained variables -------- 
##               Variables      VIF
## 1       TempSeasonality 2.305406
## 2    MeanTempWetQuarter 1.920244
## 3 MeanTempDriestQuarter 1.389453
## 4   MeanTempColdQuarter 2.579416
## 5     PrecipWarmQuarter 3.278745
##  [1] "sea"                   "broad_wood"            "conif_wood"           
##  [4] "arable"                "impr_grass"            "neutr_grass"          
##  [7] "calc_grass"            "acid_grass"            "fen_marsh_swamp"      
## [10] "heather"               "heath_grass"           "bog"                  
## [13] "inland_rock"           "saltwater"             "freshwater"           
## [16] "sup_lit_rock"          "sup_lit_sed"           "lit_rock"             
## [19] "lit_sed"               "saltmarsh"             "urban"                
## [22] "suburban"              "AnnualTemp"            "MeanDiRange"          
## [25] "Isotherm"              "TempSeasonality"       "MeanTempWetQuarter"   
## [28] "MeanTempDriestQuarter" "MeanTempColdQuarter"   "PrecipWarmQuarter"    
## [31] "elevation_UK"          "slope"                 "aspect"

Selecting subset of data

Plot number of data points in the whole of the UK and for the region of interest that I’ve chosen. Can see that Tyria jacobaeae is the most recorded species at both extents.

## # A tibble: 6 x 25
##      X1  X1_1 TO_ID TO_STARTDATE TO_ENDDATE DT_ID TO_GRIDREF TO_LOCALITY GR_ID
##   <dbl> <dbl> <dbl> <chr>        <chr>      <dbl> <chr>      <chr>       <dbl>
## 1     1   165  1410 13/07/2005   13/07/2005     1 NM430268   Isle of Mu~     1
## 2     2   166  1411 13/07/2005   13/07/2005     1 NM430268   Isle of Mu~     1
## 3     3  1466  6557 02/07/2005   02/07/2005     1 SK322596   Ash Brook,~     1
## 4     4  1937 10384 16/07/2002   16/07/2002     1 SV918113   St Mary's ~     1
## 5     5  2393 13520 21/07/2004   21/07/2004     1 SK440660   Holmewood ~     1
## 6     6  2394 13521 21/07/2004   21/07/2004     1 SK440660   Holmewood ~     1
## # ... with 16 more variables: VC_ID <dbl>, SQ_10 <chr>, CONCEPT <chr>,
## #   TO_PRECISION <dbl>, LATITUDE <dbl>, LONGITUDE <dbl>, lon <dbl>, lat <dbl>,
## #   date <date>, yr <dbl>, jd <dbl>, sp_n <chr>, com_n <chr>, ig <chr>,
## #   ag <chr>, id <chr>
## [1] FALSE

Thinning records

Alter the records of presences to get only unique locations of any given species in the data.

thinned_occ <- ndf %>% dplyr::select(lon, lat, sp_n, TO_GRIDREF) %>%  
  group_by(sp_n, TO_GRIDREF) %>% 
  distinct() %>% 
  mutate(species = sp_n,
         year = 2015) # need a year column for Rob's function to work, just a dummy year

Calculating prevalence

Calculate prevalence of each species to use in selecting the number of pseudoabsences. First, get number of unique grid cells for each species and the number of records for each species. Second, get the prevalence by dividing the number of unique cells the each species appears in by the number of unique cells across all species (i.e. the total number of cells sampled). Third, remove all duplicate values so have one species per row. Fourth, calculate the number of pseudo absences needed for the duplicated data and the thinned data by dividing the number of cells/records by the percentage prevalence (to get the value of 1%) n_records / prev * 100 and multiplying it by the percentage of cells that the species wasn’t in (1-prev) * 100.

# length(unique(sp_y$TO_GRIDREF))

sp_y_prev <- sp_y %>% group_by(sp_n) %>% 
  mutate(n_cells = length(unique(TO_GRIDREF)), # number unique locations
         n_records = length(TO_GRIDREF)) %>%  # total number of records
  ungroup() %>% 
  mutate(prev = n_cells/length(unique(TO_GRIDREF))) %>% # get the prevalence
  dplyr::select(sp_n, n_cells, n_records, prev) %>% # select only columns of interest
  distinct() %>% # one row per species 
  mutate(dup_rec_PA = round(n_records/(prev*100) *((1-prev)*100), 0), # calculate PA from duplicated records
         thin_rec_PA = round(n_cells/(prev*100) *((1-prev)*100), 0)) # calculate PA from single records

sp_y_prev$dup_rec_PA  <- ifelse(sp_y_prev$dup_rec_PA>6664, 6664, sp_y_prev$dup_rec_PA) # some PAs are to high for duplicated records

head(arrange(sp_y_prev, -n_cells))
## # A tibble: 6 x 6
##   sp_n                 n_cells n_records   prev dup_rec_PA thin_rec_PA
##   <chr>                  <int>     <int>  <dbl>      <dbl>       <dbl>
## 1 Tyria jacobaeae         1605      6664 0.403        6664        2376
## 2 Odezia atrata            580       983 0.146        5764        3401
## 3 Zygaena lonicerae        572       825 0.144        4917        3409
## 4 Chiasmia clathrata       572      2067 0.144        6664        3409
## 5 Zygaena filipendulae     501       901 0.126        6258        3480
## 6 Ematurga atomaria        394       662 0.0990       6027        3587

The code for getting pseudoabsences from duplicated data while accounting for prevalence generated an error when the number of pseudoabsences was too high for certain species. Therefore, I just reduced the number of pseudoabsences to the maximum number of records which was for Tyria jacobaeae at 6664. This is a quick and dirty way of doing it but as I outlined above, this method doesn’t really make much sense in the grand scheme of things.

I’m going to get the absences/pseudoabsences on all of the species but then run the logistic regression models on only a subset of species because the code seems to crash R all the time.

generate pseudoabsences and run models

Now need to generate the four different pseudoabsence data frames and run the logistic regression models. Going to run the models immeadiately after the pseudoabsence generation. To speed things up, I am only going to run the models for a subset of 6 species (Orgyia antiqua, Perizoma albulata, Tyria jacobaeae, Euclidia mi, Zygaena filipendulae, Chiasmia clathrata). However, the pseudoabsences and prevalence are calculated using all species.

The ‘original way’

Pseudoabsences

Full dataset (with duplicated records) and matched number of pseudoabsences.

#### Create presence/absence
spp <- unique(sp_y$species) ## all species
spp_sub <- c('Orgyia antiqua', 'Perizoma albulata', 'Tyria jacobaeae', 'Euclidia mi', 'Zygaena filipendulae', 'Chiasmia clathrata')

# spp_sub %in% spp

registerDoParallel(cores = detectCores() - 1)

system.time( 
  ab1 <- foreach(i = 1 : length(spp)) %dopar% {
    
    cpa(spdat = ndf, species = spp[i], matchPres = TRUE,
        minYear = 2000, maxYear = 2017, recThresh = 5)
    
  }
)
##    user  system elapsed 
##    0.13    0.12    0.82
registerDoSEQ()

names(ab1) <- spp


# ab1 <- lapply(spp, FUN = function(x) cpa(spdat = ndf, species = x, matchPres = TRUE,
#         minYear = 2000, maxYear = 2017, recThresh = 5))

LR model

Takes ~2 mins to run with only 6 species.

# do all species in parallel to save time
registerDoParallel(cores = detectCores() - 1)


system.time(
  lr_org <- foreach(s = 1:length(spp_sub), #.combine='comb', .multicombine=TRUE,
                    # .init=list(list()), 
                    .packages = c('tidyverse', 'raster', 'readr', 'dismo'),
                    .errorhandling = 'pass') %dopar% {
                      
                      # print(paste(s, spp[s], sep = " "))
                      
                      if(is.null(ab1[s])){
                        # print(paste("No data for species", spp[s]))
                        next
                      }
                      
                      sdm_lr <- fsdm(species = spp_sub[s], model = "lr", 
                                     climDat = ht, spData = ab1, knots = -1,
                                     k = 5,
                                     prediction = T,
                                     inters = F, 
                                     write =  F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
                      
                      se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
                      # save(se_out, file = paste0("C:/Users/thoval/Documents/Analyses/lr_outs/", spp[s], "_lr_se_error_prediction.rdata"))
                      
                      list(sdm_lr, se_out)
                    }
)
##    user  system elapsed 
##    2.60    1.93  137.62
registerDoSEQ()

# lapply(spp_sub, FUN = function(x) {
#   
#   sdm_lr <- fsdm(species = x, model = "lr", 
#                  climDat = ht, spData = ab1, knots = -1,
#                  k = 5,
#                  prediction = T,
#                  inters = F, 
#                  write =  F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
#   se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
#   list(sdm_lr, se_out)
# }
# )

The output of this function is a bit odd as it has a list within another list. The first list is each separate species, and the second list is the output from the fsdm function [[1]] and the standard error [[2]]. Obviously, with models that do not have a standard error prediction there will be no entry in the second list.

’original + prevalence`

Pseudabsences

The number of pseudoabsences are determined by multiplying the number of records by 1-prevalence*100 of a species, as described above.

Generated pseudoabsences with the data containing duplicates of the same species found in the same cell over time doesn’t make sense because for some species will get a hyper-inflated number of PA points (e.g. species that are quite common in a very restricted area). This takes too long and crashes the computer so I’m going to leave it for now but the code is ready. At some point I still intend to run it to see what difference it makes to the SDMs though.

LR model

‘Thinned + matched’

Pseudoabsences

Thinned data and equal number of presences and absences.

registerDoParallel(cores = detectCores() - 1)

system.time( 
  ab1_thin_match <- foreach(i = 1 : length(spp)) %dopar% {
    
    cpa(spdat = thinned_occ, species = spp[i], matchPres = TRUE,
        minYear = 2000, maxYear = 2017, recThresh = 5)
    
  }
)
##    user  system elapsed 
##    0.08    0.01    0.36
registerDoSEQ()

names(ab1_thin_match) <- spp

LR model

# do all species in parallel to save time
registerDoParallel(cores = detectCores() - 1)


system.time(
  lr_thin_match <- foreach(s = 1:length(spp_sub), #.combine='comb', .multicombine=TRUE,
                           # .init=list(list()), 
                           .packages = c('tidyverse', 'raster', 'readr', 'dismo'),
                           .errorhandling = 'pass') %dopar% {
                             
                             # print(paste(s, spp[s], sep = " "))
                             
                             if(is.null(ab1_thin_match[s])){
                               # print(paste("No data for species", spp[s]))
                               next
                             }
                             
                             sdm_lr <- fsdm(species = spp_sub[s], model = "lr", 
                                            climDat = ht, spData = ab1_thin_match, knots = -1,
                                            k = 5,
                                            prediction = T,
                                            inters = F, 
                                            write =  F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
                             
                             se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
                             # save(se_out, file = paste0("C:/Users/thoval/Documents/Analyses/lr_outs/", spp[s], "_lr_se_error_prediction.rdata"))
                             
                             list(sdm_lr, se_out)
                           }
)
##    user  system elapsed 
##    2.22    1.79  146.96
registerDoSEQ()

‘Thinned + prevalence’

Pseudoabsences

Thinned data and pseudoabsences are determined by multiplying the number of records by (1-prevalence)*100 of a species.

registerDoParallel(cores = detectCores() - 1)

system.time( 
  ab1_thin_prev <- foreach(i = 1 : length(spp)) %dopar% {
    
    cpa(spdat = thinned_occ, species = spp[i], matchPres = FALSE,
        nAbs = sp_y_prev$thin_rec_PA[sp_y_prev$sp_n == spp[i]],
        minYear = 2000, maxYear = 2017, recThresh = 5)
    
  }
)
##    user  system elapsed 
##    0.10    0.03    0.47
registerDoSEQ()

names(ab1_thin_prev) <- spp

LR models

# do all species in parallel to save time
registerDoParallel(cores = detectCores() - 1)

system.time(
  lr_thin_prev <- foreach(s = 1:length(spp_sub), #.combine='comb', .multicombine=TRUE,
                          # .init=list(list()), 
                          .packages = c('tidyverse', 'raster', 'readr', 'dismo'),
                          .errorhandling = 'pass') %dopar% {
                            
                            # print(paste(s, spp[s], sep = " "))
                            
                            if(is.null(ab1_thin_prev[s])){
                              # print(paste("No data for species", spp[s]))
                              next
                            }
                            
                            sdm_lr <- fsdm(species = spp_sub[s], model = "lr", 
                                           climDat = ht, spData = ab1_thin_prev, knots = -1,
                                           k = 5,
                                           prediction = T,
                                           inters = F, 
                                           write =  F, outPath = "C:/Users/thoval/Documents/Analyses/lr_outs/")
                            
                            se_out <- predict(ht, sdm_lr$Model, type = "response", se.fit = TRUE, index = 2)
                            # save(se_out, file = paste0("C:/Users/thoval/Documents/Analyses/lr_outs/", spp[s], "_lr_se_error_prediction.rdata"))
                            
                            list(sdm_lr, se_out)
                          }
)
##    user  system elapsed 
##    4.15    4.55  172.70
registerDoSEQ()

plotting the results

plot the results for each of the species in the dataset.

As you can see, thinning the data to only include presemce/absence points does affect the distributions (plots in the first column vs those in the second). Although the amount that it affects them and the direction it affects them in seems to differ between species (even with this very small subset of species). Thinning the data prior to running the model and choosing pseudoabsences based on prevalence has a drastic impact on the probability of presence. Increasing the number of pseudoabsences directly reduces the probability of occurrence. I think this is because it makes the model ‘expect’ the species to be less prevalent across the whole area of interest. This makes sense for some of the rarest species, as they are not expected to occur in 50% of cells. For some species, e.g. Perizoma albulata, this seems to make the model think that the species is unlikely to be present in any of the cells, even though it clearly is based on the raw data.

par(mfrow = c(1,3))

for(i in 1:length(lr_org)){
  
  plot(lr_org[[i]][[1]]$Predictions, main = paste(lr_org[[i]][[1]]$Species, 'Original way',sep="\n"))
  points(x = lr_org[[i]][[1]]$Data$lon[lr_org[[i]][[1]]$Data$val == 1],
         lr_org[[i]][[1]]$Data$lat[lr_org[[i]][[1]]$Data$val == 1],
         cex = 0.6, col = "red", pch = 20)
  
  plot(lr_thin_match[[i]][[1]]$Predictions, main = paste(lr_thin_match[[i]][[1]]$Species, 'Thinned + matched',sep="\n"))
  points(x = lr_org[[i]][[1]]$Data$lon[lr_org[[i]][[1]]$Data$val == 1],
         lr_org[[i]][[1]]$Data$lat[lr_org[[i]][[1]]$Data$val == 1],
         cex = 0.6, col = "red", pch = 20)
  
  plot(lr_thin_prev[[i]][[1]]$Predictions, main = paste(lr_thin_prev[[i]][[1]]$Species, 'Thinned + prevalence',sep="\n"))
  # points(x = lr_org[[i]][[1]]$Data$lon[lr_org[[i]][[1]]$Data$val == 1],
  #        lr_org[[i]][[1]]$Data$lat[lr_org[[i]][[1]]$Data$val == 1],
  #        cex = 0.6, col = "red", pch = 20)
  
}

par(mfrow = c(1,1))